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1, INTRODUCTION 

Natural language processing has been an interesting area of research in machine learning. Artificial 
intelligence provided to the machines enables them to cope up with the native languages used by the humans. 
Complexity of the native languages is one of the most challenging problems to deal with the Natural Language 
processing. To design intelligent machines machine learning technique neural network can be used [1]. Unlike 
computer language keywords, meaning of the keyword changes with sentences in native languages where 
ambiguity is at the peak. Semantic analysis can be done with the help of corpus associated with the language. 

India is a multilinguistic Country where in each state speaks different language. Language boundary 
and cultural differences make its beauty in diversity. With 22 major languages, written in 13 different scripts, 
with over 720 dialects, India stands to be one of the largest multilinguistic countries in Asia. Malayalam, Tamil 
and Telugu are the prominent languages in South India. Malayalam is native language of the state Kerala 
spoken by 38 million people, Kannada, the native language of Karnataka and Tamil, native language of Tamil 
Nadu and also official language of two other countries Singapore and Sri Lanka. Tamil is spoken by a total 70 
million people. Apart from these languages, English has become the common language spoken in India. 

In the earlier stages of computers only English language were widely used in the documents, emails 
and messages. To make computer adaptable to all sectors of people, even somebody who does not know 
English, only way out was to make computer enabled with native languages. Introduction of fonts in native 
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languages enabled its usage widely. Usage of Computers and Mobile Apps became widespread. Machines 
Translations played a vital role in converting any language to any other language. Translations are done with 
the help of dictionaries or wordnet of language, exact word by word translations may end in changing the entire 
meaning of the context. 

Nal ajao0al is the Malayalam word that represents a kind of breakfast in South India, many of the 
translators available to provide translation to English interpret the word as subheading or upright flour. The 
major Challenge dealing with the native language is the degree of ambiguity that add on to every context 
making the accuracy to fall below considering word by word translations. Different architectures for 
performing translation task is Rule Based Machine Translation and Statistical Machine translation [2]. 

Current generation saw the need of transliteration than translation. People were accustomed to English 
Writing and Reading than their native language, where in English words took most its place. Prominence of 
English words in our day to day life is so high that it became convenient in substituting the words in native 
language. But people who were well versed in speaking native language but not that well versed in writing or 
reading started using native language typed in English which 1s called as Transliterations for ease of 
communication. Fact that humans are more comfortable in their Natural Language when it comes to expression 
of words has its application in this context. There are basically three approaches for transliteration. They are 
based on grapheme, Phoneme and Hybrid Approaches. In grapheme approach, it directly transforms grapheme 
from source to target. In Phoneme model the key is pronunciation of source language. Hybrid model uses both 
the grapheme and phoneme model information. 

Transliteration can be dated back to 1994 where major work was in the area of Arabic-English [3]. A 
generative model for back transliteration from English to Japanese was proposed in 1997[4].Mathematical 
approximation technique using statistical model was used in English Korean Transliteration in the year 2000 
[5]. An automatic character alignment method for English word and Korean transliteration is discussed in [6]. 
In year 2002, a hybrid model [7] was built on phonetic and spelling mappings using Finite state machines. 
Transliteration of Arabic names in to English was done by this method. In 2004, a new framework allowing 
direct orthographic mapping (DOM) between two different languages, through a joint source-channel model, 
also called n-gram transliteration model (TM) was introduced [8]. It generates probabilistic orthographic 
transformation rules using a data driven approach. Phonemic interpretation, level is skipped, so that the error 
rate in transliteration is reduced significantly. 

Sample Transliteration: 


Nal aja0a01 will be typed in English as Uppumavu. 


Cross Linguistic is the usage of multiple languages in the same text. This effect is due to the influence 
of other languages especially English in their native language. Cross linguistic and Transliterations are the two 
issues that have to be addressed in the analysis of Social media text. When it comes to data analysis than 
language boundaries meaning of the data matters. In application like analysing the review of the products, on 
mining the web, we may have to analyze reviews in different languages, transliterations about the same product 
etc. So it 1s important to identify to which language the text belongs to in order to understand meaning in the 
text. Identification of the language in the Indian scenario is one of the toughest jobs when concerned with 
number of existing languages. Usage of transliteration in the social media text had made the problem 
even worse. 


2. ALGORITHM FOR LANGUAGE IDENTIFICATION IN CROSS LINGUAL AND 
TRANSLITERATION TEXT 

In this paper we describe algorithmic stop words based model for the identification of particular 
language in a text of conversation. Stop words are basically the most common words used inside a language. 
Procuring of the appropriate data set, 1s achallenging task. Social media text can be either transliterated or it 
can be mixture of multiple languages. The algorithm identifies the language with respect to three languages 
used inside the text, Malayalam, Tamil and English.Several machine learning algorithms are used for the 
categorization of languages in a multilinguistic approach.Category of the classification algorithm ranges from 
the simple naive bayesian approach to complex deep learning algorithms.Hybrid methodology is also followed 
to bring out best features among supervised algorithm and unsupervised algorithm.Stop word based model is 
simple method that divides the text in to language bags based on the stop words. 


Algorithm for Stop word-based Language Detection Model 
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1.Remove the content E from I, Where I is the input text and 
E= oo we elles !, @,#,$,%,,&,*,(,),_,-.=,+,emotional icons } 


2.Divide the sentence S in to set W, where W is set of unique words in S 
3.Invert the case of W to form set w, where w € lowercase(W) 


4. For each wjelement of w, sj element of S, where S is the set of transliterated stopwords of all languages 
for i= 1 ton 
strcmp (wi, Si) = ki, kiis the match for each language Li. 


4. Find K=> ki 
5. If Ki= count (Zi), Zi is the count of matched stopwords of language Li 
6. Find M=max (Kj) 





7.Language L is identified as the one with largest M value. 


3. DATA SET 

Transliterated text of stop words of languages Tamil, Malayalam, English were collected. Sample of 
more than 1000 stop words in each languages were used. Transliteration of stop words are used to train the 
model. Cross linguistic input text was collected from Twitter, Facebook and Whatsapp. 


Table 1. Sample stop word samples of English, Tamil and Malayalam 


me 
my 
myself 
we 


ours 
ourselves 
you 
you're 
you've 
you'll 
you'd 
Your 
Yours 
yourself 
yourselves 
he 

him 

his 
himself 
she 

she's 

her 

hers 
herself 
it 

LES 

its 
itself 
they 


oru 
endru 
matt rum 
indha 
idhu 
Naalai 
endra 
kondu 
enbadhu 
pala 
aaqum 
alladhu 
ayvar 
naan 
ulla 
andha. 
ivar 
ena 
mudhal 
enna 
irundhu 
sila 
en 
pondra 
vyendum 
vYandhu 
idhan 
adhu 
ayvan 
thaan 
palarum 


aksharam 
oru 
Pparanju 
enna 
roopa 
Sarkar 
Ssammanam 
bharya 
adheham 
thanne 
sSamsthana 
keralam 
makkal 
ninnu 
nair 
Vare 
cheythu 
muthal 
dey 
puthiya 
uthgadanam 
Manthri 
ennal 

aa 
mathram 
innu 
kottayam 
ninnum 
kuduthal 
ippol 
eppol 


them ermal niryathanayi 





4. EXPERIMENTAL RESULTS 

Program was executed for basically two kinds of input. One input with pure Malayalam and Tamil 
text written in English or can be called as transliterated text of English and Tamil. Other one with combination 
of two languages. Output was compared with actual results to record the performance index. 
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Run LangDetector-StopWords %- + |Run © LlangDetector-StopWords %- +. 
—@ 
gut =e Enter the Text for detecting Language saco 
=) Enter the Text for detecting Language =. aes ag oct et = ad une : 
— - ee —- rendu side sandai podureanga. oru 
oo mt Hospitalil aéeyippoyl Adutha divesam on fl Lee . 3 
— fe coor - : r — cinemavukeaaga sandal poduvathu seri 
— vallom poyl kaanananm. kuttoos nammalod > oy am _ eI a 
~ ud “* : : x = 2ila. vetri thoivi patri rasigergei 
LO e] rodu nks okke nju. Support , ‘ : - ‘ ; = 
ct e as ne fae Pea yee | x kavalipada thevai illai. atheal producer 
x< =. | mattrum distributor parkatum 
> Stop Word Ratios in Input text is as below : Stop Word Ratios in Input text is as below 
. english - 0 english = 0 
malayalam - 4 malayalam -1 
tamil = l tamil a “SB 
I etect ‘ 1 ralam. he Detected Language is tamil. 
The Detected Language is malayalam The Detected t 1 
Process finished with exit code 0 Process finished with exit code Oo 


Figure |. Transliterated input text of pure Malayalam — Figure 2. Transliterated input text of pure Tamil 














} Run! LangDetector-Stop Words G- i Bun! LangDetector-StopWords G@- i 
> I 
= ee — = 
|=#) Enter the Text for detecting Language 
on || (ee No I think you had the last laugh. ee Setar the Test for devectiag Lamgeage 
| = - — ; a ; : a 
eS = Avasanam vegam nandi paranju orotta : i Naalai 12th Result Coming.So, ifth Exam 
SI 7 a, {a7 
re ba ottam aayirunnallo.. BE ft Eluthiya Namma @imSrathisid , 
- Ts : cad = 
x bel Stop Word Ratios in Input text is as below P 4 en en ee ee ee 
enqlish i: > All The Best..Dont 4qet Treat 
7? pa = Stop Word Ratios in Input text is as below 
aa Malayalam - 2 : 
| ; : english - 3 
tamil - 0 malayalam - 0 
The Detected Language is english. tamil _3 


The Detected Language is english. 
Process finished with exit code 0 


Process finished with exit code Oo 


Figure 3. Transliterated input text with English with Figure 4. Transliterated input text with English 
Tamil mixed with Malayalam 


5. CONCLUSION 

Performance of the algorithm is satisfactory since 80% of accurate results predicted proved correct. 
Performance of the algorithm depends on extensive list of stop words. In Combinational sentences correctness 
proved to be the least since language detection is purely based on whether word is present in stop word list or 
not. With the list of appropraiate stops words this work can be extended to other native languages in India. 
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